adjusted_rand_score (Adjusted Rand Index / ARI)#
The Adjusted Rand Index (ARI) is an external clustering metric: it compares two partitions of the same \(n\) samples (e.g., ground truth labels vs a clustering algorithm’s output).
\(1.0\) means the two partitions are identical (up to a permutation of label names).
\(0.0\) is the score you expect by chance under a standard random-labeling model.
Negative values mean worse-than-chance agreement.
Quick import (scikit-learn)#
from sklearn.metrics import adjusted_rand_score
Goals#
Build intuition via pair agreements (the Rand index).
See why “adjusting for chance” matters.
Derive the ARI formula using a contingency table.
Implement ARI from scratch in NumPy and verify vs
sklearn.Use ARI to tune/compare a simple clustering algorithm on labeled benchmark data.
When it’s useful#
Benchmarking a clustering algorithm against known labels (synthetic or labeled datasets)
Comparing two clustering outputs (agreement / stability)
Prerequisites#
Basic clustering vocabulary (cluster labels as a partition)
Combinatorics: number of unordered pairs \(\binom{n}{2}\)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import adjusted_rand_score
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Intuition: clustering as “pair decisions” (Rand index)#
Any clustering/labeling answers this yes/no question for every unordered pair of samples \((i, j)\):
Are \(i\) and \(j\) in the same cluster?
Given two partitions (say true labels and predicted labels), every pair falls into one of four buckets:
\(a\): same in true and same in pred
\(b\): different in true and different in pred
\(c\): same in true, different in pred
\(d\): different in true, same in pred
The Rand Index (RI) is the fraction of agreeing decisions:
RI is easy to understand, but it has an issue: it can be large even for random clusterings.
def pair_confusion_counts(labels_true: np.ndarray, labels_pred: np.ndarray) -> dict[str, int]:
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
if labels_true.shape != labels_pred.shape:
raise ValueError("labels_true and labels_pred must have the same shape")
n = labels_true.size
same_true = labels_true[:, None] == labels_true[None, :]
same_pred = labels_pred[:, None] == labels_pred[None, :]
upper = np.triu(np.ones((n, n), dtype=bool), k=1)
a = int(np.sum(same_true & same_pred & upper))
b = int(np.sum(~same_true & ~same_pred & upper))
c = int(np.sum(same_true & ~same_pred & upper))
d = int(np.sum(~same_true & same_pred & upper))
return {"a_same_same": a, "b_diff_diff": b, "c_same_diff": c, "d_diff_same": d}
# A tiny example
y_true = np.array([0, 0, 0, 1, 1, 2])
y_pred = np.array([1, 1, 0, 0, 0, 2])
counts = pair_confusion_counts(y_true, y_pred)
n_pairs = y_true.size * (y_true.size - 1) // 2
ri = (counts["a_same_same"] + counts["b_diff_diff"]) / n_pairs
counts, n_pairs, ri
({'a_same_same': 2, 'b_diff_diff': 9, 'c_same_diff': 2, 'd_diff_same': 2},
15,
0.7333333333333333)
2) Why “adjust for chance”?#
In many datasets, most pairs are in different clusters. If two random partitions happen to put lots of pairs into “different/different”, RI becomes large — even though the clusterings are not meaningfully similar.
ARI fixes this by applying the classic “adjusted-for-chance” idea:
The remaining question is: what is “Index” and how do we compute its expectation efficiently?
3) The contingency table view (efficient computation)#
Let the true partition be \(U = \{U_1, \dots, U_r\}\) and the predicted partition be \(V = \{V_1, \dots, V_s\}\).
Define the contingency table counts:
Think in terms of unordered pairs. The number of pairs that are in the same cluster in both partitions is:
The number of pairs that are in the same true cluster (regardless of prediction) is:
and similarly for the predicted clusters:
Under the standard “random labeling with fixed cluster sizes” model (hypergeometric), the expected index becomes:
The adjusted Rand index is then (Hubert & Arabie, 1985):
This avoids the \(O(n^2)\) pair enumeration and instead works in roughly \(O(n + rs)\).
def comb2(x: np.ndarray | int) -> np.ndarray | float:
x_arr = np.asarray(x, dtype=np.int64)
return x_arr * (x_arr - 1) / 2.0
def contingency_matrix_numpy(
labels_true: np.ndarray,
labels_pred: np.ndarray,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
if labels_true.shape != labels_pred.shape:
raise ValueError("labels_true and labels_pred must have the same shape")
true_labels, true_inv = np.unique(labels_true, return_inverse=True)
pred_labels, pred_inv = np.unique(labels_pred, return_inverse=True)
cont = np.zeros((true_labels.size, pred_labels.size), dtype=np.int64)
np.add.at(cont, (true_inv, pred_inv), 1)
return cont, true_labels, pred_labels
def rand_index_numpy(labels_true: np.ndarray, labels_pred: np.ndarray) -> float:
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
n = labels_true.size
if n < 2:
return 1.0
cont, _, _ = contingency_matrix_numpy(labels_true, labels_pred)
sum_nij = float(np.sum(comb2(cont)))
sum_ai = float(np.sum(comb2(cont.sum(axis=1))))
sum_bj = float(np.sum(comb2(cont.sum(axis=0))))
n_pairs = float(comb2(n))
a = sum_nij
c = sum_ai - a
d = sum_bj - a
b = n_pairs - a - c - d
return (a + b) / n_pairs
def adjusted_rand_index_numpy(labels_true: np.ndarray, labels_pred: np.ndarray) -> float:
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
n = labels_true.size
if n < 2:
return 1.0
cont, _, _ = contingency_matrix_numpy(labels_true, labels_pred)
sum_nij = float(np.sum(comb2(cont)))
sum_ai = float(np.sum(comb2(cont.sum(axis=1))))
sum_bj = float(np.sum(comb2(cont.sum(axis=0))))
n_pairs = float(comb2(n))
expected = (sum_ai * sum_bj) / n_pairs
max_index = 0.5 * (sum_ai + sum_bj)
denom = max_index - expected
if denom == 0.0:
return 1.0
return (sum_nij - expected) / denom
# Quick check: pair-based RI agrees with the contingency-based RI
ri_pairs = (counts["a_same_same"] + counts["b_diff_diff"]) / n_pairs
ri_cont = rand_index_numpy(y_true, y_pred)
ri_pairs, ri_cont
(0.7333333333333333, 0.7333333333333333)
# Sanity checks + sklearn parity
tests = [
(np.array([0]), np.array([1])),
(np.array([0, 1]), np.array([1, 0])), # label permutation
(np.array([0, 0, 0]), np.array([1, 1, 1])), # one cluster vs one cluster
(np.array([0, 0, 0]), np.array([0, 1, 2])), # one cluster vs all singletons
(np.array([0, 1, 2]), np.array([2, 1, 0])), # all singletons vs all singletons
(y_true, y_pred),
]
for yt, yp in tests:
ari_np = adjusted_rand_index_numpy(yt, yp)
ri_np = rand_index_numpy(yt, yp)
ari_sk = float(adjusted_rand_score(yt, yp))
print(f"ARI numpy={ari_np: .6f} sklearn={ari_sk: .6f} | RI={ri_np: .6f}")
ARI numpy= 1.000000 sklearn= 1.000000 | RI= 1.000000
ARI numpy= 1.000000 sklearn= 1.000000 | RI= 1.000000
ARI numpy= 1.000000 sklearn= 1.000000 | RI= 1.000000
ARI numpy= 0.000000 sklearn= 0.000000 | RI= 0.000000
ARI numpy= 1.000000 sklearn= 1.000000 | RI= 1.000000
ARI numpy= 0.318182 sklearn= 0.318182 | RI= 0.733333
4) ARI vs RI under random labelings (chance correction in action)#
Below, we keep a fixed “true” clustering, then generate random predicted labels with different numbers of clusters.
RI tends to increase when you have many clusters, because “different/different” pairs dominate.
ARI stays centered around 0, because it subtracts the expected agreement under the random model.
n = 200
# A fixed reference partition (imbalanced on purpose)
y_ref = np.repeat([0, 1, 2], [120, 60, 20])
rng.shuffle(y_ref)
ks = [2, 3, 5, 10, 20]
n_trials = 150
ri_by_k = {k: [] for k in ks}
ari_by_k = {k: [] for k in ks}
for k in ks:
for _ in range(n_trials):
y_rand = rng.integers(0, k, size=n)
ri_by_k[k].append(rand_index_numpy(y_ref, y_rand))
ari_by_k[k].append(adjusted_rand_index_numpy(y_ref, y_rand))
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
"Rand Index (RI) under random predictions",
"Adjusted Rand Index (ARI) under random predictions",
),
)
for k in ks:
fig.add_trace(go.Box(y=ri_by_k[k], name=f"k={k}", boxmean=True, showlegend=False), row=1, col=1)
fig.add_trace(go.Box(y=ari_by_k[k], name=f"k={k}", boxmean=True, showlegend=False), row=1, col=2)
fig.update_yaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="score", row=1, col=2)
fig.update_layout(title="Chance correction: RI vs ARI", height=420)
fig.show()
5) Visual intuition on 2D data#
ARI ignores the geometry of points — it only looks at the partition. Still, plotting helps connect “good/bad partitions” with the score.
# Synthetic 2D blobs (NumPy only)
centers = np.array([[-2.0, 0.0], [2.0, 0.0], [0.0, 3.0]])
std = 0.6
n_per = 150
X = np.vstack([rng.normal(loc=c, scale=std, size=(n_per, 2)) for c in centers])
y_true_vis = np.repeat(np.arange(len(centers)), n_per)
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=y_true_vis.astype(str),
title="Ground truth partition (for illustration)",
labels={"x": "x1", "y": "x2", "color": "true cluster"},
)
fig.update_traces(marker=dict(size=6, opacity=0.85))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
def corrupt_labels(labels: np.ndarray, frac: float, rng: np.random.Generator) -> np.ndarray:
labels = np.asarray(labels)
if not (0.0 <= frac <= 1.0):
raise ValueError("frac must be in [0, 1]")
unique = np.unique(labels)
if unique.size < 2:
return labels.copy()
n = labels.size
m = int(round(frac * n))
idx = rng.choice(n, size=m, replace=False)
out = labels.copy()
for i in idx:
current = out[i]
choices = unique[unique != current]
out[i] = rng.choice(choices)
return out
# Four different predicted partitions
y_pred_perfect = y_true_vis.copy()
perm = {0: 2, 1: 0, 2: 1}
y_pred_permuted = np.vectorize(perm.get)(y_true_vis)
y_pred_noisy = corrupt_labels(y_true_vis, frac=0.15, rng=rng)
y_pred_random = rng.integers(0, 3, size=y_true_vis.size)
preds = [
("perfect match", y_pred_perfect),
("label permutation", y_pred_permuted),
("15% label noise", y_pred_noisy),
("random labels", y_pred_random),
]
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=[
f"{name}<br>ARI={adjusted_rand_index_numpy(y_true_vis, yp):.3f}"
for name, yp in preds
],
shared_xaxes=True,
shared_yaxes=True,
)
palette = px.colors.qualitative.Safe
for idx, (name, yp) in enumerate(preds):
row = 1 + idx // 2
col = 1 + idx % 2
for j, lab in enumerate(np.unique(yp)):
mask = yp == lab
fig.add_trace(
go.Scatter(
x=X[mask, 0],
y=X[mask, 1],
mode="markers",
marker=dict(size=5, opacity=0.85, color=palette[j % len(palette)]),
name=str(lab),
showlegend=(idx == 0),
),
row=row,
col=col,
)
fig.update_xaxes(title_text="x1")
fig.update_yaxes(title_text="x2", scaleanchor="x", scaleratio=1)
fig.update_layout(height=650, title="Same data, different partitions")
fig.show()
6) Contingency table: what ARI “sees”#
ARI only uses the contingency counts \(n_{ij}\) between two labelings. A useful diagnostic is to visualize this table.
cont, t_labels, p_labels = contingency_matrix_numpy(y_true_vis, y_pred_noisy)
fig = px.imshow(
cont,
text_auto=True,
color_continuous_scale="Blues",
labels={"x": "pred cluster", "y": "true cluster", "color": "count"},
x=p_labels.astype(str),
y=t_labels.astype(str),
title="Contingency matrix (true vs predicted)",
)
fig.update_layout(height=420)
fig.show()
sum_nij = float(np.sum(comb2(cont)))
sum_ai = float(np.sum(comb2(cont.sum(axis=1))))
sum_bj = float(np.sum(comb2(cont.sum(axis=0))))
n_pairs = float(comb2(cont.sum()))
sum_nij, sum_ai, sum_bj, n_pairs
(24498.0, 33525.0, 33538.0, 101025.0)
7) Sensitivity: ARI vs “label noise”#
If we take a reference partition and randomly reassign a fraction of points to different clusters, ARI should smoothly decrease toward \(0\) and sometimes go negative.
fracs = np.linspace(0, 0.8, 17)
n_trials = 80
means = []
stds = []
for frac in fracs:
vals = [adjusted_rand_index_numpy(y_true_vis, corrupt_labels(y_true_vis, frac, rng)) for _ in range(n_trials)]
means.append(float(np.mean(vals)))
stds.append(float(np.std(vals)))
means = np.array(means)
stds = np.array(stds)
fig = go.Figure()
fig.add_trace(go.Scatter(x=fracs, y=means, mode="lines+markers", name="mean ARI"))
fig.add_trace(
go.Scatter(
x=np.concatenate([fracs, fracs[::-1]]),
y=np.concatenate([means - stds, (means + stds)[::-1]]),
fill="toself",
fillcolor="rgba(0,0,0,0.12)",
line=dict(color="rgba(0,0,0,0)"),
name="±1 std",
)
)
fig.update_layout(
title="ARI vs fraction of corrupted labels",
xaxis_title="fraction corrupted",
yaxis_title="ARI",
height=420,
)
fig.show()
8) Using ARI to “optimize” a simple algorithm (model selection)#
ARI requires reference labels (it’s an external metric), so it’s typically used for:
benchmarking clustering algorithms on labeled datasets
tuning hyperparameters (e.g., number of clusters \(k\)) when you do have labels (synthetic data, weak labels, semi-supervised)
Below we implement a tiny k-means from scratch and choose \(k\) by maximizing ARI against the known labels.
def kmeans_single_run(
X: np.ndarray,
k: int,
n_iters: int,
rng: np.random.Generator,
) -> tuple[np.ndarray, np.ndarray, float]:
X = np.asarray(X, dtype=float)
n = X.shape[0]
if k <= 0 or k > n:
raise ValueError("k must satisfy 1 <= k <= n_samples")
init_idx = rng.choice(n, size=k, replace=False)
centers = X[init_idx].copy()
for _ in range(n_iters):
d2 = np.sum((X[:, None, :] - centers[None, :, :]) ** 2, axis=2)
labels = np.argmin(d2, axis=1)
new_centers = centers.copy()
for j in range(k):
mask = labels == j
if np.any(mask):
new_centers[j] = X[mask].mean(axis=0)
if np.allclose(new_centers, centers):
centers = new_centers
break
centers = new_centers
inertia = float(np.sum((X - centers[labels]) ** 2))
return labels, centers, inertia
def kmeans_from_scratch(
X: np.ndarray,
k: int,
n_init: int = 10,
n_iters: int = 50,
rng: np.random.Generator | None = None,
) -> tuple[np.ndarray, np.ndarray, float]:
if rng is None:
rng = np.random.default_rng(0)
best_labels = None
best_centers = None
best_inertia = np.inf
for _ in range(n_init):
labels, centers, inertia = kmeans_single_run(X, k=k, n_iters=n_iters, rng=rng)
if inertia < best_inertia:
best_labels, best_centers, best_inertia = labels, centers, inertia
return best_labels, best_centers, float(best_inertia)
# Choose k by maximizing ARI on this labeled benchmark
k_grid = range(2, 7)
results = []
for k in k_grid:
labels_k, centers_k, inertia_k = kmeans_from_scratch(X, k=k, n_init=15, n_iters=60, rng=rng)
ari_k = adjusted_rand_index_numpy(y_true_vis, labels_k)
results.append((k, ari_k, inertia_k))
results
[(2, 0.5606692408470919, 1250.1620852830713),
(3, 1.0, 291.49398337816893),
(4, 0.8667985639192363, 253.34081970634463),
(5, 0.7233392630713212, 215.61620906403215),
(6, 0.5838640527190261, 189.80284588981752)]
ks = np.array([r[0] for r in results])
aris = np.array([r[1] for r in results])
inertias = np.array([r[2] for r in results])
fig = make_subplots(rows=1, cols=2, subplot_titles=("ARI vs k", "Inertia vs k (k-means objective)"))
fig.add_trace(go.Scatter(x=ks, y=aris, mode="lines+markers", name="ARI"), row=1, col=1)
fig.add_trace(go.Scatter(x=ks, y=inertias, mode="lines+markers", name="inertia", showlegend=False), row=1, col=2)
fig.update_xaxes(title_text="k", row=1, col=1)
fig.update_xaxes(title_text="k", row=1, col=2)
fig.update_yaxes(title_text="ARI", row=1, col=1)
fig.update_yaxes(title_text="inertia (SSE)", row=1, col=2)
fig.update_layout(height=420, title="Using ARI for model selection (requires labels)")
fig.show()
best_idx = int(np.argmax(aris))
best_k = int(ks[best_idx])
best_k
3
best_labels, best_centers, best_inertia = kmeans_from_scratch(X, k=best_k, n_init=30, n_iters=80, rng=rng)
best_ari = adjusted_rand_index_numpy(y_true_vis, best_labels)
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=best_labels.astype(str),
title=f"Best k by ARI: k={best_k} (ARI={best_ari:.3f})",
labels={"x": "x1", "y": "x2", "color": "pred cluster"},
)
fig.add_trace(
go.Scatter(
x=best_centers[:, 0],
y=best_centers[:, 1],
mode="markers",
marker=dict(symbol="x", size=14, color="black"),
name="centers",
)
)
fig.update_traces(marker=dict(size=6, opacity=0.85), selector=dict(mode="markers"))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
Pros and cons#
Pros
Label permutation invariant (cluster IDs don’t matter)
Chance-corrected: random partitions have expected score near \(0\)
Works when the number of clusters differs between partitions
Interpretable endpoints: \(1\) = identical partitions
Cons / limitations
Requires reference labels (not usable for pure unsupervised model selection)
Based on pair counting: can be dominated by large clusters; treats all pair errors equally
Ignores geometry/distances (two very different geometric clusterings can get the same ARI if labels match)
Not a smooth/differentiable objective → generally not used directly in gradient-based optimization
Common pitfalls#
Using ARI without ground truth: prefer internal metrics (silhouette, Davies–Bouldin, etc.) when labels are absent.
Confusing “label names” with clusters: ARI doesn’t care about the numeric IDs (0/1/2), only the partition structure.
Naive \(O(n^2)\) implementation: pair enumeration is slow for large \(n\); prefer the contingency-table formula.
Interpreting negative scores: ARI < 0 means worse-than-chance agreement under the reference random model.
Exercises#
Implement the unadjusted Rand index (RI) in two ways: pair counting and contingency table. Confirm they match.
Create two different partitions of the same data that have similar RI but different ARI. Explain why.
For imbalanced cluster sizes, run the random-labeling experiment again and see how RI behaves as you change the number of predicted clusters.
Replace k-means initialization with k-means++ and compare ARI and inertia across runs.
References#
scikit-learn:
sklearn.metrics.adjusted_rand_scorehttps://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification.